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Abstract 



^ 1 ■ The problem of selecting the right state-representation in a reinforcement learning 

problem is considered. Several models (functions mapping past observations to 
a finite set) of the observations are given, and it is known that for at least one of 
these models the resulting state dynamics are indeed Markovian. Without know- 
ing neither which of the models is the correct one, nor what are the probabilistic 
characteristics of the resulting MDP, it is required to obtain as much reward as the 
optimal policy for the correct model (or for the best of the correct models, if there 
are several). We propose an algorithm that achieves that, with a regret of order 
T2/3 Yvjjgj-g T is the horizon time. 



1 Introduction 



We consider the problem of selecting the right state-representation in an average-reward reinforce- 
ment learning problem. Each state-representation is defined by a model (pj (to which corresponds a 
state space ) and we assume that the number J of available models is finite and that (at least) one 
I model is a weakly-communicating Markov decision process (MDP). We do not make any assump- 

5-H ■ tion at all about the other models. This problem is considered in the general reinforcement learning 

setting, where an agent interacts with an unknown environment in a single stream of repeated ob- 
servations, actions and rewards. There are no "resests," thus all the learning has to be done online. 
Our goal is to construct an algorithm that performs almost as well as the algorithm that knows both 
which model is a MDP (knows the "true" model) and the characteristics of this MDP (the transition 
probabilities and rewards). 

Consider some examples that help motivate the problem. The first example is high-level feature 
selection. Suppose that the space of histories is huge, such as the space of video streams or that of 
game plays. In addition to these data, we also have some high-level features extracted from it, such 
as "there is a person present in the video" or "the adversary (in a game) is aggressive." We know that 
most of the features are redundant, but we also know that some combination of some of the features 
describes the problem well and exhibits Markovian dynamics. Given a potentially large number 
of feature combinations of this kind, we want to find a policy whose average reward is as good as 
that of the best policy for the right combination of features. Another example is bounding the order 
of an MDP. The process is known to be fc-order Markov, where k is unknown but un upper bound 
K >> fc is given. The goal is to perform as well as if we knew k. Yet another example is selecting 
the right discretization. The environment is an MDP with a continuous state space. We have several 
candidate quantizations of the state space, one of which gives an MDP. Again, we would like to find 
a policy that is as good as the optimal poUcy for the right discretization. This example also opens 
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the way for extensions of the proposed approach: we would like to be able to treat an infinite set 
of possible discretization, none of which may be perfectly Markovian. The present work can be 
considered the first step in this direction. 

It is important to note that we do not make any assumptions on the "wrong" models (those that do 
not have Markovian dynamics). Therefore, we are not able to test which model is Markovian in the 
classical statistical sense, since in order to do that we would need a viable alternative hypothesis 
(such as, the model is not Markov but is /-C-order Markov). In fact, the constructed algorithm never 
"knows" which model is the right one; it is "only" able to get the same average level of reward as if 
it knew. 

Previous work. This work builds on previous work on learning average-reward MDPs. Namely, 
we use in our algorithm as a subroutine the algorithm UCRL2 of [6| that is designed to provide 
finite time bounds for undiscounted MDPs. Such a problem has been pioneered in the reinforcement 
learning literature by \2] and then improved in various ways by [5] HT] [TJ] IS O ; UCRL2 achieves a 
regret of the order DT^^"^ in any weakly-communicating MDP with diameter I?, with respect to the 
best policy for this MDP. The diameter of a MDP is defined in f6l as the expected minimum time 
required to reach any state starting from any other state. A related result is reported in tJJ, which 
improves on constants related to the characteristics of the MDP. 

A similar approach has been considered in llTOl ; the difference is that in that work the probabilistic 
characteristics of each model are completely known, but the models are not assumed to be Marko- 
vian, and belong to a countably infinite (rather than finite) set. 

The problem we address can be also viewed as a generalization of the bandit problem (see e.g. (|9] 
m [T]): there are finitely many "arms", corresponding to the policies used in each model, and one 
of the arms is the best, in the sense that the corresponding model is the "true" one. In the usual 
bandit setting, the rewards are assumed to be i.i.d. thus one can estimate the mean value of the arms 
while switching arbitrarily from one arm to the next (the quality of the estimate only depends on the 
number of pulls of each arm). However, in our setting, estimating the average-reward of a policy 
requires playing it many times consecutively . This can be seen as a bandit problem with dependent 
arms, with complex costs of switching between arms. 

Contribution. We show that despite the fact that the true Markov model of states is unknown 
and that nothing is assumed on the wrong representations, it is still possible to derive a finite-time 
analysis of the regret for this problem. This is stated in Theorem[Tl the bound on the regret that we 
obtain is of order T"^/^. 

The intuition is that if the "true" model 0* is known, but its probabilistic properties are not, then we 
still know that there exists an optimal control policy that depends on the observed state Sj* t only. 
Therefore, the optimal rate of rewards can be obtained by a clever exploration/exploitation strategy, 
such as UCRL2 algorithm |6|. Since we do not know in advance which model is a MDP, we need 
to explore them all, for a sufficiently long time in order to estimate the rate of rewards that one can 
get using a good policy in that model. 

Outline. In Section|2]we introduce the precise notion of model and set up the notations. Then we 
present the proposed algorithm in Section |3] it uses UCRL2 of |6| as a subroutine and selects the 
models according to a penalized empirical criterion. In Section ^we discuss some directions for 
further development. Finally, Section |5]is devoted to the proof of Theorem[T] 

2 Notation and definitions 

We consider a space of observations O, a space of actions A, and a space of rewards TZ (all assumed 
to be Polish). Moreover, we assume that A is of finite cardinality A |^| and that G 7?. C [0, 1]. 
The set of histories up to timet for all t e NU {0} will be denoted by H<f =^ O x (yt x 7?. x 0)*"^ 

oo 

and we define the set of all possible histories by H = (J H<t. 

t=i 

Environments. For a Polish X, we Denote by V{X) the set of probability distributions over X . 
Define an environment to be a mapping from the set of histories T-L to the set of functions that map 
any action a £ .4 to a probability distribution Va ^ 'P{Ti x O) over the product space of rewards 
and observations. 
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We consider the problem of reinforcement learning when the learner interacts with some unknown 
environment e*. The interaction is sequential and goes as follows: first some /i<i = {oq} is gen- 
erated according to t, then at time step t > {), the learner choses an action at ^ A according to the 
current history h^t G ^<t- Then a couple of reward and observations {rt,Ot) is drawn according 
to the distribution {e* {h^t))at & V{TZ x O). Finally, h^t+i is defined by the concatenation of /i<t 
with {at,rt,ot). With these notations, at each time step t > 0, ot^i is the last observation given 
to the learner before choosing an action, at is the action output at this step, and rt is the immediate 
reward received after playing at- 

State representation functions (models). Let 5 c N be some finite set; intuitively, this has to be 
considered as a set of states. A state representation function is a function from the set of histories 
Ti, to S. For a state representation function 0, we will use the notation 5^ for its set of states, and 

st,0 4'{h<t)- 

In the sequel, when we talk about a Markov decision process, it will be assumed to be weakly 
communicating, which means that for each pair of states ui, U2 there exists A; e N and a sequence 
of actions ai, ..,ak E A such that P{sk+i.,f, = U2|si,0 = ui,ai — ai. ..ak = cuk) > 0. Having 
that in mind, we introduce the following definition. 

Definition 1 We say that an environment e with a state representation function <f) is Markov, or, for 
short, that (j) is a Markov model (of e), if the process {st.,/,, at,rt),t e N is a (weakly communicating) 
Markov decision process. 

For example, consider a state-representation function (j> that depends only on the last observation, 
and that partitions the observation space into finitely many cells. Then an environment is Markov 
with this representation function if the probability distribution on the next cells only depends on the 
last observed cell and action. Note that there may be many state-representation functions with which 
an environment e is Markov. 

3 Main results 

Given a set $ — {(f) j; j ^ J} of J state-representation functions (models), one of which being 
a Markov model of the unknown environment e*, we want to construct a strategy that performs 
nearly as well as the best algorithm that knows which is Markov, and knows all the probabilistic 
characteristics (transition probabilities and rewards) of the MDP corresponding to this model. For 
that purpose we define the regret of any strategy at time T, like in |[6l|3|, as 

t=\ 

where rt are the rewards received when following the proposed strategy and p* is the average optimal 
value in the best Markov model, i.e., p* = limy rt(7r*)) where rt{Tr*) are the rewards 

received when following the optimal policy for the best Markov model. Note that this definition 
makes sense since when the MDP is weakly communicating, the average optimal value of reward 
does not depend on the initial state. Also, one could replace Tp* with the expected sum of rewards 
obtained in T steps (following the optimal policy) at the price of an additional 0{Vt) term. 

In the next subsection, we describe an algorithm that achieves a sub-linear regret of order T^/'^. 
3.1 Best Lower Bound (BLB) algoritlim 

In this section, we introduce the Best-Lower-Bound (BLB) algorithm, described in Figure [T] 

The algorithm works in stages of doubling length. Each stage consists in 2 phases: an exploration 
and an exploitation phase. In the exploration phase, BLB plays the UCRL2 algorithm on each 
model {(f>j)i^j^J successively, as if each model 0j was a Markov model, for a fixed number r^ i j 
of rounds. The exploitation part consists in selecting first the model with highest lower bound, 
according to the empirical rewards obtained in the previous exploration phase. This model is initially 
selected for the same time as in the exploration phase, and then a test decides to either continue 
playing this model (if its performance during exploitation is still above the corresponding lower 
bound, i.e. if the rewards obtained are still at least as good as if it was playing the best model). If it 
does not pass the test, then another model (with second best lower-bound) is select and played, and 
so on. Until the exploitation phase (of fixed length 2) finishes and the next stage starts. 
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Parameters: f, S 

For each stage i ^ 1 do 

Set the total length of stage i to be Ti := 2'. 

1. Exploration. Set n.i — rf^^. For each j G {1, . . . , J} do 

- Run UCRL2 with parameter Si{5) defined in (TJ using (f)j during ri,i,j 
steps: the state space is assumed to be S^j with transition structure 
induced by (f>j. 

- Compute the corresponding average empirical reward Jj.i,i{4>j) received 
during this exploration phase. 

2. Exploitation. Set ri,2 = fi — Ti^i and initialize J :— {1, . . . , J} . 
While the current length of the exploitation part is less than Ti.2 do 

- Select j = argmax/ii,i(<^j) — 2B{i, cj>j, 5) (using (O). 

- Run UCRL2 with parameter Si{5) using cjf>~: update at each time step t 
the current average empirical reward ^i^2,t{4rj) from the beginning of 
the run. Provided that the length of the current run is larger than r^^i, j, 
do the test 

Mi,2,t(<A~) > V-i,ii(t>j) - 2B(i, S) . 

- If the test fails, then stop UCRL2 and set J := J \ {j}. If J ^ 
then set J := {1, . . . , J}. 



Figure 1 : The Best-Lower-Bound selection strategy. 

dcf 

The length of stage i is fixed and defined to be = 2*. Thus for a total time horizon T, the number 

of stages I{T) before time T is I(T) =^ Llog2(T + l)j. Each stage i (of length r^) is further 
decomposed into an exploration (length Ti.i) and an exploitation (length ri.2) phases. 

Exploration phase. All the models {(t>j}j^,j are played one after another for the same amount of 

time Ti i j ^-j-. Each episode 1 ^ j ^ J consists in running the UCRL2 algorithm using the 
model of states and transitions induced by the state-representation function 0j. Note that UCRL2 
does not require the horizon T in advance, but requires a parameter p in order to ensure a near 
optimal regret bound with probability higher than 1 — p. We define this parameter p to be Si (S) in 
stage i, where 

S,{S) =^ (2* - ( + l)22*/3 + 4)-^2-'+^S . (1) 
The average empirical reward received during each episode is written /i^ 

Exploitation phase. We use the empirical rewards /ti received in the previous exploration 
part of stage i together with a confidence bound in order to select the model to play. Moreover, a 
model (j) is no longer run for a fixed period of time (as in the exploration part of stage i), but for a 
period Ti^2{4') that depends on some test; we first initialize J :— {1, . . . , J} and then choose 

j =^ argmax/2j,i(0j) - 2B{i, (j)j,5) , (2) 

where we define 

B{t,^,S) 34/(r, - 1 + T,,i)|5^u/ 'Ji^ , (3) 

V '^i,i,J 

where S and the function / are parameters of the BLB algorithm. Then UCRL2 is played using the 
selected model for the parameter 5i{6). In parallel we test whether the average empirical reward 
we receive during this exploitation phase is high enough; at time t, if the length of the current episode 
is larger than ti ^ j, we test if 

M<,2,t(</>j) ;^/J^,i(<^)-2B(z,,/.j,(5). (4) 

If the test is positive, we keep playing UCRL2 using the same model. Now, if the test fails, then the 
model j is discarded (until the end of stage i) i.e. we update J :— J\ {j} and we select a new one 
according to (|2]i. We repeat those steps until the total time Ti^i of the exploitation phase of stage i is 
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Remark Note that the model selected for exploitation in (|2|i is the one that has the best lower bound. 
This is a pessimistic (or robust) selection strategy. We know that if the right model is selected, then 
with high probability, this model will be kept during the whole exploitation phase. If this is not the 
right model, then either the policy provides good rewards and we should keep playing it, or it does 
not, in which case it will not pass the test ^ and will be removed from the set of models that will 
be exploited in this phase. 

3.2 Regret analysis 

Theorem 1 (Main result) Assume that a finite set of J state-representation functions $ is given, 
and there exists at least one function (/)* € ^ such that with (p* as a state-representation function the 
environment is a Markov decision process. If there are several such models, let (j)* be the one with 
the highest average reward of the optimal policy of the corresponding MDP. Then the regret (with 
respect to the optimal policy corresponding to (j)*) of the BLB algorithm run with parameter S, for 
any horizon T, with probability higher than \ — 5 is bounded as follows 

A(r)<c/(r)s(AJiog((j5)-^)iog2(T))'^V/^ + c'z)5(Aiog(5-^)iog2(r)r)'^' + c(/,z)), (5) 

for some numerical constants c,c' and c{f,D). The parameter f{t) can be chosen to be any 
increasing function, for instance the choice fit) := logj t + 1, gives c{f, D) ^ 2^. 
The proof of this result is reported in Section |5] 

Remark. Importantly, the algorithm considered here does not know in advance the diameter D of 
the true model, nor the time horizon T. Due to this lack of knowledge, it uses a guess f{t) (e.g. 
\og{t)) on this diameter, which result in the additional regret term c(/, D) and the additional factor 
/(T); knowing D would enable to remove both of them, but this is a strong assumption. Choosing 
/(t) log2 t + 1 gives a bound which is of order T^/"^ in T but is exponential in D; taking 
f{t) :— we get a bound of order T'^/^+^ in T but of polynomial order 1 /e in D. 

4 Discussion and outlook 

Intuition. The main idea why this algorithm works is as follows. The "wrong" models are used 
during exploitation stages only as long as they are giving rewards that are higher than the rewards 
that could be obtained in the "true" model. All the models are explored sufficiently long so as 
to be able to estimate the optimal reward level in the true model, and to learn its poUcy. Thus, 
nothing has to be known about the "wrong" models. This is in stark contrast to the usual situation 
in mathematical statistics, where to be able to test a hypothesis about a model (e.g., that the data is 
generated by a certain model versus some alternative models), one has to make assumptions about 
alternative models. This has to be done in order to make sure that the Type II error is small (the 
power of the test is large): that this error is small has to be proven under the alternative. Here, 
although we are solving seemingly the same problem, the role of the Type II error is played by the 
rewards. As long as the rewards are high we do not care where the model we are using is correct or 
not. We only have to ensure that the true model passes the test. 

Assumptions. A crucial assumption made in this work is that the "true" model (ff belongs to a 
known finite set. While passing from a finite to a countably infinite set appears rather straightfor- 
ward, getting rid of the assumption that this set contains the true model seems more difficult. What 
one would want to obtain in this setting is sub-linear regret with respect to the performance of the 
optimal policy in the best model; this, however, seems difficult without additional assumptions on 
the probabilistic characteristics of the models. Another approach not discussed here would be to try 
to build a good state representation function, as what is suggested for instance in 15|. Yet another 
interesting generalization in this direction would be to consider uncountable (possibly parametric 
but general) sets of models. This, however, would necessarily require some heavy assumptions on 
the set of models. 

Regret. The reader familiar with adversarial bandit literature will notice that our bound of order 
T is worse than T^/^ that usually appears in this context (see, for example [2]). The reason is 
that our notion of regret is different: in adversarial bandit literature, the regret is measured with 
respect to the best choice of the arm for the given fixed history. In contrast, we measure the regret 
with respect to the best policy (for knows the correct model and its parameters) that, in general, 
would obtain completely different (from what our algorithm would get) rewards and observations 
right from the beginning. 
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Estimating tlie diameter? As previously mentioned, a possibly large additive constant c(/, D) 
appears in the regret since we do not known a bound on the diameter of the MDP in the "true" 
model, and use logT instead. Finding a way to properly address this problem by estimating online 
the diameter of the MDP is an interesting open question. Let us provide two intuitions concerning 
this problem. First, we notice that, as reported in [6 1, when we compute an optimistic model based 
on the empirical rewards and transitions of the true model, the span of the corresponding optimistic 
value function sp{V^) is always smaller than the diameter D. This span increases as we get more 
rewards and transitions samples, which gives a natural empirical lower bound on D. However, it 
seems quite difficult to compute a tight empirical upper bound on D (or sp{V'^)). In |[3l, the authors 
derive a regret bound that scales with the span of the true value function spiV*), which is also less 
than D, and can be significantly smaller in some cases. However, since we do not have the property 
that sp{V^) ^ sp{V*), we need to introduce an explicit penalization in order to control the span of 
the computed optimistic models, and this requires assuming we know an upper bound B on sp{V*) 
in order to guarantee a final regret bound scaling with B. Unfortunately this does not solve the 
estimation problem of D, which remains an open question. 

5 Proof of Theorem [1] 

In this section, we now detail the proof of Theorem[T] The proof is stated in several parts. First we 
remind a general confidence bound for the UCRL2 algorithm in the true model. Then we decompose 
the regret into the sum of the regret in each stage i. After analyzing the contribution to the regret in 
stage i, we then gather all stages and tune the length of each stage and episode in order to get the 
final regret bound. 



5.1 Upper and Lower confidence bounds 

From the analysis of UCRL2 in ||6), we have the property that with probability higher than 1 — 5' , 
the regret of UCRL2 when run for r consecutive many steps from time ti in the true model 0* is 
upper bounded by 



, rt^MD\S^.\\ (6) 

T ^ — ' V T 

t=ti 

where D is the diameter of the MDP. What is interesting is that this diameter does not need to be 
known by the algorithm. Also by carefully looking at the proof of UCRL, it can be shown that the 
following bound is also valid with probability higher than \~ 8': 



, n-p* ^ MD\S, 

T ^ ' ' ■ V T 



We now define the following quantity, for every model (j>, episode length r and 5' e (0,1) 



BoiT,<p,6')'£3iD\S,\J^^^. (7) 



5.2 Regret of stage i 

In this section we analyze the regret of the stage i, which we denote A^. Note that since each stage 
z ^ / is of length = 2* except the last one / that may stop before, we have 

I{T) 

A(r) = ^ A, , (8) 

i=l 

where/(T) — Llog2(r+l)j. We further decompose A^ — Ai_i+Aj, 2 into the regret corresponding 
to the exploration stage Ai ^ and the regret corresponding to the exploitation stage A^ 2- 

Recall that t^ i is the total length of the exploration stage i and Ti_2 is the total length of the exploita- 
tion stage i. Then for each model 0, we write i j =^ the number of consecutive steps during 
which the UCRL2 algorithm is run with model (/> in the exploration stage i, and t,;.2(0) the number 
of consecutive steps during which the UCRL2 algorithm is run with model cj) in the exploitation 
stage i. 
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Good and Bad models. Let us now introduce the two following sets of models, defined after the 
end of the exploration stage, i.e. at time ti. 

g, {0e<i>; Mo(0)-2i?(j,0,5)>^o(0*)-2i?(j,0*,5)}\{^*}, 

{0e$; M,4(,^)-2B(i,0,5)<^o(0*)-2B(i,0*,<5)}. 
With this definition, we have the decomposition ^ = QiU {(j)*} U Bi. 

5.2.1 Regret in the exploration phase 

Since in the exploration stage i each model (j) is run for i j many steps, the regret for each model 
(f> ^ (f)* is bounded by n^ijp*. Now the regret for the true model is Ti^i,j{p* — /ii(0*)), thus the 
total contribution to the regret in the exploration stage i is upper-bounded by 

5.2.2 Regret in the exploitation phase 

By definition, all models in Qi U {(f)*} are selected before any model in Bi is selected. 

The good models. Let us consider some (f) € Gi and an event fti under which the exploitation 
phase does not reset. The test (equation (|4]i) starts after i.j, thus, since there is not reset, either 
'Ti,2{'P) = TiA-J ™ which case the contribution to the regret is bounded by Ti^i^jp* , or Ti^2{4>) > 
Ti,!.,;, in which case the regret during the (t^ 2(0) — 1) steps (where the test was successful) is 
bounded by 

(t,.2(0) - l){p* - Ai^,2,r„.(0)-l(0)) < (n,2(0) - 1){P* ~ %A<i>) + 2B(*, </',<5)) 

(t.,2(0) - 1)(/ - %A'^*) + 2B(z, , 

and now since in the last step fails to pass the test, this adds a contribution to the regret at most p* . 

We deduce that the total contribution to the regret of all the models G in the exploitation stages 
on the event is bounded by 

^ max{T,,i,j/, (t,,2(0) - - M,4(0*) + 2S(z, 0*, 5)) + p*} . (10) 

The true model. First, let us note that since the total regret of the true model during the exploitation 
step i is given by 

Tj,2(0*)(P* - /ij,2,t(0*)) , 

then the total regret of the exploration and exploitation stages in episode i on 17^ is bounded by 

Ai ^ TiS,.l{p* — pi{(l)*)) + Ti,l,j{J — l)p* + Ti,2(0*)(/5* — Pi,2,ti + TiA4>*)) + 

^ max{r,,i,j/9*,(ri,2((?!)) - l)(p* - M»,i(<?^*) + '2B{i, (p* , S)) + p*} + ^ n^Wp* ■ 

Now from the analysis provided in |6 | we know that when we run the UCRL2 with the true model 
(j)* with parameter 6i{5), then there exists an event fii ^ of probability at least 1 — 5i{5) such that on 
this event 

/-/2,,i(0*) ^;B,5(r,,l,J,(/.^(5,((5)), 
and similarly there exists an event D,2^i of probability at least 1 — 5i{5), such that on this event 

/ - P^,2,tm ^ BD{n,2{4'*). 0*, 5l{5)) . 

Now we show that, with high probability, the true model 0* passes all the tests (equation (|4]i) until 
the end of the episode i, and thus equivalently, with high probability no model G Si is selected, 

so that ^ Ti,2(0) = 0. 

For the true model, after r((/i*, t) ^ 1 ,7, there remains at most (ri.2 — Ti.i,,/ + 1) possible timesteps 
where we do the test for the true model 0*. For each test we need to control /ii,2.t(0*), and the event 
corresponding to /i^ 1(0*) is shared by all the tests. Thus we deduce that with probability higher 
than 1 — (t,;.2 — Ti.ij + 2)5i{6) we have simultaneously on all time step until the end of exploitation 
phase of stage i, 

^ -BD{T{cP\t),(f>\ S,{d)) - Boin^ij, (P\ 6,{6)) 
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Now provided that f{ti) > D, then Buin^i^j, (p*,6i{6)) ^ B{i, (p*,6) , thus the true model passes 
all tests until the end of the exploitation part of stage i on an event fia^i of probability higher than 

1 — {Ti,2 — Ti^i,j + 2)5i{5). Since there is no reset, we can choose fJj =^ Os^j. Note that on this 

event, we thus have ^ Ti,2{(j>) = 0. 

<pGB, 

By using a union bound over the events fli^i, fl2,i and fl^^i, then we deduce that with probability 
higher than 1 - (rj,2 - n^ij + 4:)Si{6), 

+ max{(ri,i,j - l)p\{n,2{cp) - l){BD{Ti,i,j,(P*,Si{S)) +2B{i,<i>*,S)} . 

<t>eQi 

Now using again the fact that f{ti) ^ D, and after some simplifications, we deduce that 

A, ^ T,:,i„7Sz3(r,,i„7,0*,(5,((5)) +r,,2((i)*)B^5(r,,2(0*),</'^(5^(<5)) 
+ ^(ri,2(0) - l)3S(^,</.^(5) +T,,i,j(J+ l^il - 1^ . 

Finally, we use the fact that tBjj{t, (p*, 5i{5j) is increasing with r to deduce the following rough 
bound that holds with probabihty higher than 1 — (Ti,2 — Ti,i,j + 4)5i(5) 

Ai < Ti,2B{i, (/)*, S) + Ti,2BD{Ti,2, Si{5)) + 2jTi^i^jp* , 

where we used the fact that ri,2 = Ti^2{<i>*) + Ti,2{<i>) ■ 

<t>eg 

5.3 Tuning the parameters of each stage. 

We now conclude by tuning the parameters of each stage, i.e. the probabilities 5i {5) and the length 
Tj, Tj_i and Ti^2- The total length of stage i is by definition 

Ti = Ti,l + Ti,2 = Ti,l,jJ + Ti,2 , 
def 2/3 def 2/3 ^^/^ 

where = 2' . So we set Tj^i = r- ' and then we have rj^2 = ^i — and Ti^i^j = . Now 
using these values and the definition of the bound 0*, 5), and Bo{Ti^2, 4>*- ^i{S)), we deduce 
with probability higher than 1 — (rj,2 — Ti,i,J + the following upper bound 



2/3 



A, < 34/(t,)5^AJlog(-^)rf/^ + 34i.5^Alog(^)n + 2.V3,., 
with = 2* - 1 + 2^*/^ and where we used the fact that (^73) ^ n,2 ^ Vlrf 



2/3 



We now define 6^{5) such that 5^(5) = (2^ - ( J-^ + 1)2^'/^ + 'i)-^2-'+^S . 

Since for the stages i € Xq {i ^ 1; f{ti) < D}, the regret is bounded by Aj ^ Tip*, then the 
total cumulative regret of the algorithm is bounded with probability higher than 1 — ^ (using the 
defition of the Si{S)) by 

A(T) ^ m{ti)S^JA\og(^) + 2]22^/3 + 34£>5'^^log(^)2^ + ^ Tp* . 
where = 2' - 1 + 22'/3 ^ t. 

We conclude by using the fact that since I{T) ^ log2(T +1), then with probability higher than 
1 — 6, the following bound on the regret holds 

A(r) < cf{T)s(^AJlog{JS)-' log2(r))'^V2/3 + c'Ds(^A\og{S-^) log2(r)r)'^' + c{f,D) . 



def 



forsomeconstantc,c', and where c(/,Z)) = X^jgXo ^V*- Now for the special choice when /(T) 
log2(r+l),theni e Iq means 2'+22'/3 < 2^+2, thus we must have i < A and thus c(/, d) ^ 2^. 
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